- workflow is really important
- there are ideas and techniques that will help you
- after many painful mistakes you figure out some of them
Let's aim to avoid painful mistakes!
2015-07-21
Let's aim to avoid painful mistakes!
After this class, students will be able to:
for men increases 42%
Led to changes in divorce law in California
"First, let me begin with Peterson's implied question: Was this responsible research and did I meet professional standards in analyzing these data?" Weitzman (1996)
". . . .Changes to the original raw data file resulting from this data cleaning process were made by a series of programming statements on a master SPSS system file. The raw data file that is stored at the Murray Center is the original 'dirty data' file and does not include these cleaning changes. . . ." Weitzman (1996)
"Unfortunately, the original cleaned master SPSS system file no longer exists. I assumed it was being copied and reformatted as I moved for job changes and fellowships from the project's original offices in Berkeley to Stanford (in 1979), then to Princeton (in 1983), back to Stanford (in 1984) and then to Harvard (in 1986). With each move, new programmers worked on the files to accommodate different computer systems." Weitzman (1996)
"Before I left Stanford I instructed my programmers to prepare all my data files for archiving. I know now (but did not know then) that the original master SPSS system file that I used for my book had been lost or damaged at some point and was not included among these files. The SPSS system file that I thought was the master SPSS system file was the result of the merging of many smaller subfiles that had been created for specific analyses. It later became apparent that a programming error had been made, and the subfiles were not ``keyed'' correctly: Not all of the data from each individual respondent were matched on the appropriate case ID number, and data from different respondents were merged under the same case ID. At present it is not possible to disentangle exactly what mismatch occurred for any specific respondent." Weitzman (1996)
"When I could not replicate the analyses in my book with what I had mistakenly assumed was the archived master SPSS system file, I hired an independent consultant, Professor Angela Aidala from Columbia University, to help me untangle what had happened. She reviewed all of the project files, documentation, and codebooks, as well as the available data and programming files to determine a possible computational error in the standard of living statistic. But she could not do this without an accurate data file to work with. We then went back to the original questionnaires and recoded a random sample of about 25 percent of the cases. There were so many discrepancies between the questionnaires and the 'dirty data' raw data file, and between the questionnaires and the mismatched SPSS system file, that we finally abandoned the effort and left a warning to all future researchers that both files at the Murray Center were so seriously flawed that they could not be used. It was a very sad, time consuming, and frustrating experience. . ." Weitzman (1996)
If you find yourself staring at a folder like this…
Automation (as opposed to point-and-click) enables:
If you find yourself doing the same thing over and over, then you should consider automation. [Any examples?]
High short-term startup costs
High long-term benefits
Doesn't have to be fancy
Might be simple things like designing report so that you can run a new model and it automatically repopulates tables and figures.
Not always the right answer, but always worth considering
Do you think they use that system at Google?
Version control let's you edit the same document while preserving the ability to "go back"
We will use git and github.
Main points:
Keys are really important:
Question: If we were going to make a database of information about students in this class, is first_name a good primary key?
Answer: No, might not be unique.
Question: If we were going to make a database of information about students in this class, is first.name and last.name a good primary key?
Answer: Better, but what if we want to add to the database over time.
Question: If we were going to make a database of information about students in this class, is UGA ID number a good primary key?
Answer: Yes, unique and defined for each person.
"a philosophy of data" that underlies many of our tools (e.g, dplyr)
In tidy data:
Tidy is not what the data looks like to your eye, it is what the data looks like to the computer
Store data for computers not people
Not tidy.
Three variables are religion, income, frequency.
Example from Jenny Bryan
Q: Is this data tidy?
A: No.
What's the total number of words spoken by male hobbits?
lotr <- read.csv("data/lotr_tidy.csv", header=TRUE)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
lotr %>%
filter(Gender=="Male", Race=="Hobbit") %>%
summarise(total=sum(Words))
## total ## 1 8780
How well does your approach scale if there were many more movies or if I provided you with updated data that includes all the Races (e.g. dwarves, orcs, etc.)?
words_by_film_race <- lotr %>% group_by(Film, Race) %>% summarise(Total.Words = sum(Words)) words_by_film_race
## Source: local data frame [9 x 3] ## Groups: Film ## ## Film Race Total.Words ## 1 The Fellowship Of The Ring Elf 2200 ## 2 The Fellowship Of The Ring Hobbit 3658 ## 3 The Fellowship Of The Ring Man 1995 ## 4 The Return Of The King Elf 693 ## 5 The Return Of The King Hobbit 2675 ## 6 The Return Of The King Man 2727 ## 7 The Two Towers Elf 844 ## 8 The Two Towers Hobbit 2463 ## 9 The Two Towers Man 3990
Tall skinny data looks bad to the eye, but seems to work better for computers
Short and wide good for looking at, bad for computers
Questions about Keys, databases, and data structure
abstraction: "turning the specific instances of something into a general-purpose tool"
More concretely
Abstraction takes practice, and we will work on it all semester
Three techniques we will use to promote abstraction
Loops help you do the same thing over and over
model.us <- lm(us$income ~ us$edu)
model.us <- lm(us$income ~ us$edu) model.uk <- lm(uk$income ~ uk$edu)
. . .
model.us <- lm(us$income ~ us$edu) model.uk <- lm(uk$income ~ uk$edu) model.de <- lm(uk$income ~ de$edu) model.fr <- lm(fr$income ~ fr$edu) model.jp <- lm(jp$income ~ jp$edu) model.cn <- lm(cn$income ~ cn$edu)
countries <- c("us", "uk", "de", "fr", "jp", "cn")
for (country.code in countries) {
data.this.country <- filter(all.data, country==country.code)
model[[country.code]] <- lm(data.this.country$income ~ data.this.country$edu)
}
Questions about loops?
functions take inputs and return outputs
total <- sum(c(1, 2, 3))
functions are like you building your own tools
all complicated pieces of software use functions
analytic.data <- run.processing.code(measured.data)
Questions about functions?
refactoring is a fancy name for re-writing your code.
Often you don't know the right abstractions at the beginning.
It is almost always worth it to take a few hours and rewrite your code with the right abstractions. This happens all the time in real projects.
Questions about abstraction?
If you want to be able to reproduce your results 10 years from now, you need to write documentation.
Questions about documentation?
Every large software project uses task-management software.
Mangagement and collaboration are related: we're going to use git and github.
Why care about code style?
fewer errors
easier to change
better for collaboration
reproducible
open
Our first step to writing beautiful code is to pick a set of code conventions.
All code in this class will follow Google's R Style Guide. These are the rules that they developed and use in Google so that they can write beautiful code.
Google's success criteria: "Any programmer should be able to instantly understand structure of any code"
For more on Google's R Style Guide see Andy Chen's presentation at useR 2014.
File Names: File names should end in .R and, of course, be meaningful.
Identifiers: Don't use underscores ( _ ) or hyphens ( - ) in identifiers. Identifiers should be named according to the following conventions. The preferred form for variable names is all lower case letters and words separated with dots (variable.name), but variableName is also accepted; function names have initial capital letters and no dots (FunctionName); constants are named like functions but with an initial k.
Variable names:
Identifiers: Don't use underscores ( _ ) or hyphens ( - ) in identifiers. Identifiers should be named according to the following conventions. The preferred form for variable names is all lower case letters and words separated with dots (variable.name), but variableName is also accepted; function names have initial capital letters and no dots (FunctionName); constants are named like functions but with an initial k.
Function names:
Code style is something we will work on throughout the semester. Getting the code conventions right is just the first step.
Just like learning to write, learning to code is hard, important, and possible.
For more see, Google's R Style Guide
Gentzkow and Shapiro:
Questions and comments?
Goal check
Summary of material for next class